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Abstract 

We study the computational complexity of Markov chain Monte Carlo (MCMC) meth¬ 
ods for high-dimensional Bayesian linear regression under sparsity constraints. We first 
show that a Bayesian approach can achieve variable-selection consistency under relatively 
mild conditions on the design matrix. We then demonstrate that the statistical criterion 
of posterior concentration need not imply the computational desideratum of rapid mixing 
of the MCMC algorithm. By introducing a truncated sparsity prior for variable selection, 
we provide a set of conditions that guarantee both variable-selection consistency and rapid 
mixing of a particular Metropolis-Bastings algorithm. The mixing time is linear in the 
number of covariates up to a logarithmic factor. Our proof controls the spectral gap of 
the Markov chain by constructing a canonical path ensemble that is inspired by the steps 
taken by greedy algorithms for variable selection. 


1 Introduction 

In many areas of science and engineering, it is common to collect a very large number of 
covariates Xi,... ,Xp in order predict a response variable Y. We are thus led to instances 
of high-dimensional regression, in which the number of covariates p exceed the sample size 
re. A large literature has emerged to address problems in the regime p S> re, where the ill- 
posed nature of the problem is addressed by imposing sparsity conditions—namely, that the 
response Y depends only on a small subset of the covariates. Much of this literature is based on 
optimization methods, where penalty terms are incorporated that yield both convex [33] and 
nonconvex PEH] optimization problems. Theoretical analysis is based on general properties 
of the design matrix and the penalty function. 

Alternatively, one can take a Bayesian point of view on high-dimensional regression, plac¬ 
ing a prior on the model space and performing the necessary integration so as to obtain a 
posterior distribution [HKiniis]. Obtaining such a posterior allows one to report a subset of 
possible models along with their posterior probabilities as opposed to a single model. One 
can also report the marginal posterior probability of including each covariate. Some recent 
work has provided some theoretical understanding of the performance of Bayesian approaches 
to variable selection. In the moderate-dimension scenario (in which p is allowed to grow with 
re but p < n), Shang and Clayton pH] establish posterior consistency for variable selection 
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in a Bayesian linear model, meaning that the posterior probability of the true model that 
contains all influential covariates tends to one as n grows. Narisetty and He [26] consider a 
high-dimensional scenario in which p can grow nearly exponentially with n; in this setting, 
they show the Bayesian spike-and-slab variable-selection method achieves variable-selection 
consistency. Since this particular Bayesian method resembles a randomized version of Iq- 
penalized methods, it could have better performance than £i-penalized methods for variable 
selection under high-dimensional settings |26[ I29j . Empirical evidence for this conjecture is 
provided by Guan et al. m for SNP selection in genome-wide association studies, but it has 
not been confirmed theoretically. 

The most widely used tool for fitting Bayesian models are sampling techniques based on 
Markov chain Monte Carlo (MCMC), in which a Markov chain is designed over the param¬ 
eter space so that its stationary distribution matches the posterior distribution. Despite its 
popularity, the theoretical analysis of the computational efficiency of MCMC algorithms lags 
that of optimization-based methods. In particular, the central object of interest is the mixing 
time of the Markov chain, which characterizes the number of iterations required to converge 
to an e-distance of stationary distribution from any initial configuration. In order for MCMC 
algorithms to be controlled approximations, one must provide meaningful bounds on the mix¬ 
ing time as a function of problem parameters such as the number of observations and the 
dimensionality. Of particular interest is determining whether the chain is rapidly mixing — 
meaning that the mixing time grows at most polynomially in the problem parameters—or 
slowly mixing meaning that the mixing time grows exponentially in the problem parameters. 
In the latter case, one cannot hope to obtain approximate samples from the posterior in any 
reasonable amount of time for large models. 

Unfortunately, theoretical analysis of mixing time is comparatively rare in the Bayesian 
literature, with a larger number of negative results. On the positive side, Jones and Robert m 
consider a Bayesian hierarchical version of the one-way random effects model, and obtain 
upper bounds on the mixing time of Gibbs and block Gibbs samplers as a function of the initial 
values, data and hyperparameters. Belloni and Chernozhukov |1| show that a Metropolis 
random walk is rapidly mixing in the dimension for regular parametric models in which the 
posterior converges to a normal limit. It is more common to find negative results in the 
literature. Examples include Mossel and Vigoda |25) . who show that the MCMC algorithm 
for Bayesian phylogenetics takes exponentially long to reach the stationary distribution as 
data accumulates, and Woodard and Rosenthal |3Z|, who analyze a Gibbs sampler used for 
genomic motif discovery and show that the mixing time increases exponentially as a function 
of the length of the DNA sequence. 

The goal of the current paper is to study the computational complexity of Metropolis- 
Hastings procedures for high-dimensional Bayesian variable selection. For concreteness, we 
focus our analysis on a specific hierarchical Bayesian model for sparse linear regression, and 
an associated Metropolis-Hastings random walk, but these choices should be viewed as rep¬ 
resentative of a broader family of methods. In particular, we study the well-known Zellner 
g-prior for linear regression [38|. The main advantage of this prior is the simple expression 
that it yields for the marginal likelihood, which is convenient in our theoretical investigations. 
As in past analyses |26| , we consider the marginal probability of including each covariate into 
the model as being on the order of Moreover, we restrict the support of the prior 

to rule out unrealistically large models. As a specific computational methodology, we focus 
on an iterative, local-move and neighborhood-based procedure for sampling from the model 
space, which is motivated by the shotgun stochastic search [TTj . 

Our main contribution is to provide conditions under which Bayesian posterior consistency 
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holds, and moreover, the mixing time grows linearly in p (np to logarithmic factor), implying 
that the chain is rapidly mixing. As a by-product, we provide conditions on the hyper¬ 
parameter g to achieve model-selection consistency. We also provide a counter-example to 
illustrate that although ruling out unrealistically large models is not necessary for achieving 
variable-selection consistency, it is necessary in order that the Metropolis-Hastings random 
walk is rapidly mixing. To be clear, while our analysis applies to a fully Bayesian procedure 
for variable selection, it is based on a frequentist point of view in assuming that the data are 
generated according to a true model. 

There are a number of challenges associated with characterizing the computational com¬ 
plexity of Markov chain methods for Bayesian models. First, the posterior distribution of a 
Bayesian model is usually a much more complex object than the highly structured distribu¬ 
tions in statistical physics for which meaningful bounds on the Markov chain mixing times are 
often obtained (e.g. m, m, |21)1. Second, the transition probabilities of the Markov chain 
are themselves stochastic, since they depend on the underlying data-generating process. In 
order to address these challenges, our analysis exploits asymptotic properties of the Bayesian 
model to characterize the typical behavior of the Markov chain. We show that under con¬ 
ditions leading to Bayesian variable-selection consistency, the Markov chain over the model 
space has a global tendency of moving towards the true data-generating model, even though 
the posterior distribution can be highly irregular. In order to bound the mixing time, we make 
use of the canonical path technique developed by Sinclair mm and Diaconis and Stroock 
[8j . More precisely, the particular canonical path construction used in our proof is motivated 
by examining the solution path of stepwise regression procedures for linear model selection 
(e.g., mw, where a greedy criterion is used to decide at each step whether a covariate is 
to be included or deleted from the curent model. 

Overall, our results reveal that there is a delicate interplay between the statistical and 
computational properties of Bayesian models for variable selection. On the one hand, we 
show that concentration of the posterior is not only useful in guaranteeing desirable statistical 
properties such as parameter estimation or model-selection consistency, but they also have 
algorithmic benefits in certifying the rapid mixing of the Markov chain methods designed to 
draw samples from the posterior. On the other hand, we show that posterior consistency on 
its own is not sufficient for rapid mixing, so that algorithmic efficiency requires somewhat 
stronger conditions. 

The remainder of this paper is organized as follows. Section provides background on 
the Bayesian approach to variable selection, as well as Markov chain algorithms for sampling 
and techniques for analysis of mixing times. In Section we state our two main results 
(Theoremsandfor a class of Bayesian models for variable selection, along with simulations 
that illustrate the predictions of our theory. Section]^ is devoted to the proofs of our results, 
with many of the technical details deferred to the appendices. We conclude in Section with 
a discussion. 


2 Background and problem formulation 

In this section, we introduce some background on the Bayesian approach to variable selec¬ 
tion, as well some background on Markov chain algorithms for sampling, and techniques for 
analyzing their mixing times. 
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2.1 Variable selection in the Bayesian setting 

Consider a response vector Y G and a design matrix X G that are linked by the 

standard linear model 


Y = X/3* + w, where w ~ AA(0, cr'^In), 


( 1 ) 


and /?* G is the unknown regression vector. Based on observing the pair {Y,X), our goal 
is to recover the support set of /3*—that is, to select the subset of covariates with non-zero 
regression weights, or more generally, a subset of covariates with absolute regression weights 
above some threshold. 

In generic terms, a Bayesian approach to variable selection is based on first imposing a 
prior over the set of binary indicator vectors, and then using the induced posterior (denoted 
by 7 r (7 | Y)) to perform variable selection. Here each binary vector 7 G {0,1}^ should be 
thought of as indexing the model which involves only the covariates indexed by 7 . We make 
use of the shorthand I 7 I = Ij corresponding to the number of non-zero entries in 7 , 

or the number of active covariates in the associated model. It will be convenient to adopt a 
dualistic view of 7 as both a binary indicator vector, and as a subset of {1,... ,p}. Under this 
identification, the expression 7 C 7 ' for a pair of inclusion vectors ( 7 , 7 ^ can be understood 
as that the subset of variables selected by 7 is contained in the subset of variables selected 
by yh Similarly, it will be legitimate to use set operators on those indicator vectors, snch as 
yPly', 7 U 7 ' and ■ Using the set interpretation, we let X^ G denote the submatrix 

formed of the columns indexed by 7 , and we define the subvector j3^ G in an analogous 
manner. We make use of this notation in defining the specific hierarchical Bayesian model 


analyzed in this paper, defined precisely in Section 3.1 to follow 


2.2 MCMC algorithms for Bayesian variable selection 

Past work on MCMC algorithms for Bayesian variable selection can be divided into two main 
classes—Gibbs samplers (e.g., [a Haile]) and Metropolis-Hastings random walks (e.g. m 
HD). In this paper, we focus on a particular form of Metropolis-Hastings updates. 

In general terms, a Metropolis-Hastings random walk is an iterative and local-move based 
procedure involving three steps: 

Step 1 : Use the current state 7 to define a neighborhood X'{'y) of proposal states. 

Step 2 : Choose a proposal state 7 ' in Af{'y) according to some probability distribution S( 7 , •) 
over the neighborhood, e.g. the uniform distribution. 


Step 3: Move to the new state 7 ' with probability R( 7 , 7 '), and stay in the original state 7 
with probability 1 — R( 7 , 7 '), where the acceptance ratio is given by 


R(7,7') 


= min |l, 


TTnjj' I U)S(y,7) ^ 
7r„(7 I Y) S(7,7') 


( 2 ) 


In this way, for any fixed choice of the neighborhood structure we obtain a Markov 

chain with transition probability given by 


Pmh(7) 7O 


S(7,y)R(7,7') ifyeAA(7), 

< 0 if 7 ^ ^ ■^il) U { 7 }, and 

.1 - E 7 Y 7 ^mh( 7 , 7 ') if 7 '= 7 . 
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The specific form of Metropolis-Hastings update analyzed in this paper is obtained by 
randomly selecting one of the following two schemes to update 7, each with probability 0 . 5 . 

Single flip update: Choose an index j G [p] uniformly at random, and form the new state 
7' by setting 7' = 1 — 7^. 

Double flip update: Define the subsets 5(7) = {j G \p\ \ 7^ = 1 } and let S^{'y) = {j G 
\p] I 7i = 0 }- Choose an index pair {k,£) G S{'y) x S^{'y) uniformly at random, and 
form the new state 7' by flipping 7^ from 1 to 0 and 7^ from 0 to 1 . (If the set S{'j) is 
empty, then we do nothing.) 

This scheme can be understood as a particular of the general Metropolis-Hastings scheme in 
terms of a neighborhood M{'y) to be all models that can be obtained from 7 by either 
changing one component to its opposite (i.e., from 0 to 1, or from 1 to 0) or switching the 
values of two components with different values. 

Letting dni'y, 7O = ^( 7 i / Ij) denote the Hamming distance between 7 and 7'. the 

overall neighborhood is given by the union : = Miiy) U A/2(7), where 


■/Vi( 7 ) ■ = [i I = 1}, and 

A/ 2 ( 7 ) := {7 I dnii,1) = 2, and 3 (A:,£) G S{'y) x S^{'y) s.t. 7^, = 1 - 7^. and 7 ( = 1 -'yi]- 


With these definitions, the transition matrix of the previously described Metropolis- 
Hastings scheme takes the form 


Pmh(7)7 ) 


< 


2p 


min 11, 


TnlVW) 1 . 

vr„(7|V) J ’ 


2 15 ( 7 ) 11577)1 

0 


7r„(7|V) /’ 


if 7' G Ni (7) 
if 7' G A/2 (7) 
if dniy', 7) > 2, and 


U “ E7'77Pmh( 7,7')> if 7'= 7. 


( 3 ) 


2.3 Background on mixing times 

Let C be an irreducible, aperiodic Markov chain on the discrete state space and de¬ 
scribed by the transition probability matrix P G with stationary distribution tt. 

We assume throughout that C is reversible; i.e., it satisfies the detailed balance condition 
7r(7)P(7,7') = 7r(7')P(7', 7) for all 7,7' G It is easy to see that the previously described 
Metropolis-Hastings matrix Pmh satisfies this reversibility condition. It is convenient to iden¬ 
tify a reversible chain with a weighted undirected graph G on the vertex set where two 
vertices 7 and 7' are connected if and only if the edge weight Q( 7,70 := ^( 7 )P( 7)70 is 
strictly positive. 

For 7 G ^ and any subset S C we write P(7,5) = X]7'e5 ^( 7 ’ 70 - if 7 i® fii® 
initial state of the chain, then the total variation distance to the stationary distribution after 
t iterations is 


A^{t) = ||P”(7, •) -vr(-)||ri/ := max |P’"(7,5) -7r(5)|. 


The e-mixing time is given by 


Te : = max min |t G N I Ay(t') < e for all t' >t}, ( 4 ) 

76.^ 


5 



which measures the number of iterations required for the chain to be within distance e G (0,1) 
of stationarity. The efficiency of the Markov chain can be measured by the dependence of 
on the difficulty of the problem, for example, the dimension of the parameter space and the 
sample size. In our case, we are interesed in the dependence of on the covariate dimension p 
and the sample size n. Of particular interest is whether the chain is rapidly mixing^ meaning 
that the mixing time grows at most polynomially in the pair (p, n), or slowly mixing, meaning 
that the mixing time grows exponentially. 


3 Main results and their consequences 

The analysis of this paper applies to a particular family of hierarchical Bayes models for 
variable selection. Accordingly, we begin by giving a precise description of this family of 
models, before turning to statements of our main results and a discussion of their consequences. 
Our first result (Theorem provides sufficient conditions for posterior concentration, whereas 
our second result (Theorem]^ provides sufficient conditions for rapid mixing of the Metropolis- 
Bastings updates. 


3.1 Bayesian hierarchical model for variable selection 

In addition to the standard linear model ([^, the Bayesian hierarchical model analyzed in this 
paper involves three other ingredients: a prior over the precision parameter (p (or inverse noise 
variance) in the linear observation model, a prior on the regression coefficients, and a prior 
over the binary indicator vectors. More precisely, it is given by 


Linear model: 
Precision prior 
Regression prior 
Sparsity prior 


Y = + w, u; ~ AA(0, p 

Trip) oc i 

4 > 

(/3^ I 7 ) ~AA(0,5r'(^7^7)“') 

/1 \ 

7r(7) ot (-) Illbl^so]- 


(5a) 

(5b) 

(5c) 

(5d) 


For each model M.y, there are three parameters to be specified: the integer sq < n is a 
prespecified upper bound on the maximum number of important covariates, the hyperpa¬ 
rameter > 0 controls the degree of dispersion in the regression prior, and the hyperpa¬ 
rameter «: > 0 penalizes models with large size. For a given integer sq £ {1; • ■ • iP}) we let 
= {M 7 I I 7 I < So} the class of all models involving at most sq covariates. 

Let us make a few remarks on our choice of Bayesian model. First, the choice of covariance 
matrix in the regression prior—namely, involving —is made for analytical convenience, 

in particular in simplifying the posterior. A more realistic choice would be the independent 
prior 


/ 3 ^ I 7~AA(0,ff(/> ^/|^|). 

However, the difference between these choices will be negligible when g ^ n, which, as shown 
by our theoretical analysis, is the regime under which the posterior is well-behaved. Another 
popular choice for the prior of 13-f is the spike-and-slab prior [16] , where for each covariate Xj , 
one specifies the marginal prior for Pj as a mixture of two normal distributions, one with a 
substantially larger variance than the other, and •jj can be viewed as the latent class indicator 
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for this mixture prior. Our primary motivation in imposing Zellner’s ^r-prior is in order to 
streamline the theoretical analysis: it leads to an especially simple form of the marginal 
likelihood function. However, we note that our conclusions remain valid under essentially 
the same conditions when the independent prior or the spike-and-slab prior is used, but with 
much longer proofs. The sparsity prior on 7 is similar to the prior considered by Narisetty and 
He [ 2 ^ and Castillo et al. [?]• The p~'^ decay rate for the marginal probability of including 
each covariate imposes a vanishing prior probability on the models of diverging sizes. The only 
difference is that we put a constraint I7I < sq to rule out models with too many covariates. 

As will be clarified in the sequel, while this additional constraint is not needed for Bayesian 
variable-selection consistency, it is necessary for rapid mixing of the MCMC algorithm that 
we analyze. 

Recall from our earlier set-up that the response vector Y G M” is generated from the stan¬ 
dard linear model Y = Xf 3 * + w, where w r-u A/'(0,crg/n), / 3 * G is the unknown regression 
vector, and do the unknown noise standard deviation. In rough terms, the goal of variable 
selection is to determine the subset S of “influential” covariates. In order to formalize this 
notion, let us fix a constant Cjs > 0 depending on (do, n,p) that quantifies the minimal signal 
size requirement for a covariate to be “influential”. We then define S = S^Cjs) to consist of 
the indices with relatively large signal—namely 

S:={je\p]\ Wj\>Cp}, (6) 

and our goal is to recover this subset. Thus, the “non-influential” coefficients / 3 ^c are allowed 
to be non-zero, but their magnitudes are constrained. 

We let 7* be the indicator vector that selects the influential covariates, and let s* := |7*| 
be the size of the corresponding “true” model M.^*. Without loss of generality, we may assume 
that the hrst s* components of 7* are ones, and the rest are zeros. We assume throughout this 
section that we are in the high-dimensional regime where p > n, since the low-dimensional 
regime where re < p is easier to analyze. For any symmetric matrix Q, let An,in(Q) and Aniax(Q) 
denote its smallest and largest eigenvalues. Our analysis involves the following assumptions: 

Assumption A (Conditions on / 3 *): The true regression vector has components ( 3 * = (/SJ, ( 5 gc) 
that satisfy the bounds 

Full / 3 * condition: <5*^0 —^ 

"^n re 

1 1 

Off-support condition: || ^^A5c/3cc||^ < Lcr^ — 


for some universal constant L. 

In the simplest case, the true regression vector P* is S'-sparse (meaning that I 3 gc = 0 ), so 
that the off-support condition holds trivially. As for the full ( 3 * condition, it is known | 28 ) that 
some form of upper bound on the norm ||/3*||2 in terms of the o'-hyperparameter is required 
in order to prove Bayesian model selection consistency [ 2 ^. The necessity of such a condition 
is a manifestation of the so-called information paradox of g-priors [22j . 

Our next assumption involves an integer parameter s, which is set either to a multiple of 
the true sparsity s* (in order to prove posterior concentration) or the truncated sparsity sq 
( in order to prove rapid mixing). 
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Assumption B (Conditions on the design matrix): The design matrix has been nor¬ 
malized so that = n for all j = 1 ,... ,p; moreover, letting Z ~ N{Q,In)-, there exist 

constants v > 0 and L < oo such that 


Lower restricted eigenvalue (RE(s)): 
Sparse projection condition (SI(s)): 


/I j. 

mm Amin -X 
|7|<s Vn ' 


Ez 


max max 

|7|<sfce[p]\7 


1 

^/n 


I > v, and 


< ^y/Lulogp, 
( 7 b) 


where < 1 >^ denotes projection onto the span of {Xj,j G 7}. The lower restricted eigenvalue 
condition is a mild requirement, and one that plays a role in the information-theoretic limita¬ 
tions of variable selection [ 33 ]. On the other hand, the sparse projection condition can always 
be satisfied by choosing L = O{so). To see this, notice that — <l>.y)Afc|| < 1 and there 

are at most different choice of distinct pair (7,/c). Therefore, by the Gaussianity of qg, 
the sparse projection condition always holds with L = On the other extreme, if the 

design matrix X has orthogonal columns, then (l — ^y)Xk = Xk- As a consequence, due 
to the same argument, the sparse projection condition holds with L = which depends 

neither on s* nor on sq- 


Assumption C (Choices of prior hyperparameters): The noise hyperparameter g and 
sparsity penalty hyperparameter k > 0 are chosen such that 

^ 2a gQj^g a > 0, and 

( 7 c) 

K + a > Ci{L + L) + 2 for some universal constant Ci > 0 . 

In the low-dimensional regime p = o(n), the ^r-prior with either the unit information prior 
g = n, or the choice g = max{n,p^} have been recommended dSKiniEl]. In the intermediate 
regime where p = 0 {n), Sparks et al. [ 32 ] show that g must grow faster than pn~^ logp for 
the Bayesian linear model without variable selection to achieve posterior consistency. These 
considerations motivate us to choose the hyperparameter for the high-dimensional setting as 
g X for some a > 0, and our theory establishes the utility of this choice. 


Assumption D (Sparsity control): For a constant Cq > 4 , one of the two following 
conditions holds: 


Version D(s*): We set sq : = p in the sparsity prior ( 5 d), and the true sparsity s* is 
— sCqK {Ic^ “ IGLctqI for some constant A > 4 -|- a -|- cL. 


bounded as s* < 


Version D(so): The sparsity parameter sq in the prior ( 5 d) satisfies the sandwich relation 

1 


(2z^ ^a;(A)-|-l)s* < so < 
where U}{X):= ma.x\l(X^X^)-^X^X^*\Jf. 


n 


SCo K I logp 


— I6L(Tq 




( 7 d) 




Assumptions A, B, C and D are a common set of conditions assumed in the existing 
literature (e.g., [ 28 l| 26 ]) for establishing Bayesian variable-selection consistency; i.e., that the 
posterior probability of the true model 7rn(7*|y) —)• 1 as n —>■ 00. 










3.2 Sufficient conditions for posterior consistency 


Our first result characterizes the behavior of the (random) posterior 7r„(- | Y). As we men¬ 
tioned in Section 2.1, Bayesian variable-selection consistency does not require that the sparsity 
prior (5d) be truncated at some sparsity level much less than p, so that we analyze the hier¬ 
archical model with sq = P; and use the milder Assumption D(s*). The reader should recall 
from equation Q the threshold parameter Cp that defines the subset S = S{Cp) of influential 
covariates. 


Theorem 1 (Posterior concentration). Suppose that Assumption A, Assumption B with 
s = Ks*, Assumption C, and Assumption D{s*) hold. If the threshold Cp satisfies 


Cjj > C(){L L CX k) CTg 


2 logp 


n 


( 8 ) 


then we have 'Knil* \Y) ^ ^ — ciP ^ with probability at least 1 — C 2 P 


The threshold condition Q requires the set of influential covariates to have reasonably 
large magnitudes; this type of signal-to-noise condition is needed for establishing variable 
selection consistency of any procedure [3l]. We refer to it as the /J^in-condition in the rest of 
the paper. Due to the mildness of Assumption A (conditions on /3*), the claim in the theorem 
holds even when the true model is not exactly sparse: Assumption A allows the residual /3^c 
to be nonzero as long as it has small magnitude. 

It is worth noting that the result of Theorem covers two regimes, corresponding to 
different levels of signal-to-noise ratio. More precisely, it is useful to isolate the following two 
mutually exclusive possibilities: 


High SNR: 

S = {j ^ \p] 

/3* 7 ^ 0| and ram.\l5*A^ > co{a + k + L) , 

(9a) 

Low SNR: 

S' = 0 and 

AJn ^ V Cl J n 

(9b) 

In terms of the parameter L in 

Assumption A, The high SNR regime corresponds to L 

= 0, 


whereas the low SNR regime corresponds to L = — L. The intuition for the low 

SNR setting is that the signal in every component is so weak that the “penalty” induced by 
hyperparameters (p,k) completely overwhelms it. Theorem guarantees that the posterior 
concentrates around the model M^,* under the high SNR condition, and under the null model 
Mg,Q under the low SNR condition. More precisely, we have: 

Corollary 1. Under the conditions of Theorem\^ with probability at least 1 — C 2 P~^^: 


(a) Under the high SNR condition (9a), we have vr„( 7 * | T) > 1 — cip 


-1 


-1 


(b) Conversely, under the low SNR condition (9b), we have 'n'ni'Io \ Y) >1 — cip 

Corollary provides a complete characterization of the high or low SNR regimes, but it 
does not cover the intermediate regime, in which some component of fi* is sandwiched as 


a + K — 2 




-L 


(Tn 


jlogp 


< <CQ{a + K + L)al 


\ogp 


( 10 ) 


On one hand. Theorem still guarantees a form of Bayesian variable selection consistency 
in this regime. However, the MCMC algorithm for sampling from the posterior can exhibit 
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slow mixing due to multimodality in the posterior. In Appendix A.2 we provide a simple 


example that satisfies the conditions of Theorem so that posterior consistency holds, but 
the Metropolis-Hastings updates have mixing time growing exponentially in p. This example 
reveals a phenomenon that might seem counter-intuitive at hrst sight: sharp concentration of 
the posterior distribution need not lead to rapid mixing of the MCMC algorithm. 


3.3 Sufficient conditions for rapid mixing 

With this distinction in mind, we now turn to developing sufficient conditions for Metropolis- 
Hastings scheme Q to be rapidly mixing. As discussed in Section]^ this rapid mixing ensures 
that the number of iterations required to converge to an e-ball of the stationary distribution 
grows only polynomially in the problem parameters. The main difference in the conditions 
is that we now require Assumption B—the RE and sparse projection conditions—to hold 
with parameter s = sq, as opposed to with the smaller parameter s = Ks* sq involved in 
Theorem [TJ 


Theorem 2 (Rapid mixing guarantee). Suppose that Assumption A, Assumption B with 
s = So, Assumption C, and Assumption D(sq) all hold. Then under either the high SNR 
condition (9a) or the low SNR condition (|9b|), there are universal constants ci,C 2 such that, 


for any e G (0,1), the e-mixing time of the Metropolis-Hastings chain (|^ is upper bounded as 

Te < Cl psl {c 2 a {n + So) logp -h log(l/e) -h 2) (11) 


with probability at least I — Ap . 

According to our previous definition @ of the mixing time, Theorem characterizes 
the worst case mixing time, meaning the number of iterations when starting from the worst 
possible initialization. If we start with a good intial state—for example, the true model 
7 * would be a nice though impractical choice—then we can remove the n term in the upper 
bound 0 - In this way, the term C 1 C 2 Q; npsQ logp can be understood as the worst-case number 
of iterations required in the burn-in period of the MCMC algorithm. 

Theoremj^and Theoremj^lead to the following corollary, stating that after 0{a nps^ logp) 
iterations, the MCMC algorithm will output 7 * with high probability. 

Corollary 2. Under the conditions of Theorem\^ for any fixed iterate t such that 

t>ci psl (c 2 a (n -h So) logp -h logp-h 2), 


the iterate 'jt from the MCMC algorithm matches 7 * with probability at least 1 — C 2 P 


As with Corollary Theorem does not characterize the intermediate regime in which 
some component /ft of /3* satisfies the sandwich inequality (10). Based on our simulations, 
we suspect that the Markov chain might be slowly mixing in this regime, but we do not have 
a proof of this statement. 


3.4 Illustrative simulations 

In order to illustrate the predictions of Theorem we conducted some simulations. We also 
provide an example for which a frequentist method such as the Lasso fails to perform correct 
variable selection while our Bayesian method succeeds. 
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3.4.1 Comparison of mixing times 

In order to study mixing times and their dependence on the model structure, we performed 
simulations for linear models with random design matrices, formed by choosing row Xi G 
i.i.d. from a multivariate Gaussian distribution. In detail, setting the noise variance = 1, 
we considered two classes of linear models with random design matrices X G in each 

case formed with i.i.d. rows Xi G 

Independent design: Y ^ J\f {X/3*, In) with Xi ~ AA(0,/p) i.i.d.; 

Correlated design: Y ~ M{X/3*, In) with Xi ~ M{0, S) i.i.d. and Yjk = 

In all cases, we choose a design vector /3* G with true sparsity s* = 10, taking the form 

P* = SNR (2, -3, 2, 2, -3, 3, -2, 3, -2, 3, 0, • • • , O)^ G W, 

where SNR > 0 is a signal-to-noise parameter. Varying the parameter SNR allows us 
to explore the behavior of the chains when the model lies on the boundary of the fimin- 
condition. We performed simulations for the SNR parameter SNR G {0.5,1,2,3}, sample 
sizes n G {300,900}, and number of covariates p G {500,5000}. In all cases, we specify our 
prior model by setting the dispersion hyperparameter g = and the expected maximum 
model size sq = 100 . 

Figure plots the typical trajectories of log-posterior probability versus the number of 
iterations of the Markov chain under the independent design. In the strong signal regime 
(SNR = 3), the true model receives the highest posterior probability, and moreover the 
Metropolis-Hastings chain converges rapidly to stationarity, typically within 3p iterations. 
This observation is confirmation of our theoretical prediction of the behavior when all nonzero 
components in /I* have relative high signal-to-noise ratio (S' = {j : /3j ^ 0}). In the interme¬ 
diate signal regime (SNR = I), Bayesian variable-selection consistency typically fails to hold, 
and here, we find that the chain converges even more quickly to stationarity, typically within 
1.5p iterations. This observation cannot be fully explained by our theory. A simulation to 
follow using a correlated design shows that it is not a robust phenomenon: the chain can have 
poor mixing performance in this intermediate signal regime when the design is sufficiently 
correlated. 

In order to gain further insight into the algorithm’s performance, for each pair {X, V} we 
ran the Metropolis-Hastings random walk based on six initializations: the first three of them 
are random perturbations of the null model, whereas the remaining three are the true model. 
We made these choices of initialization because our empirical observations suggest that the 
null model and the true model tend to be near local modes of the posterior distribution. We 
run the Markov chain for 20p iterations and use the Gelman-Rubin (GR) scale factor [11] 
to detect whether the chains have reached stationarity. More precisely, we calculate the GR 
scale factor for the coefficient of determination summary statistics 

Y 

^7 = IIVII2 ’ I ^ {O’ 

IH II 2 

where ‘h.y denotes the projection matrix onto the span of {Xj,j G 7 }. Since the typical failing 
of convergence to stationarity is due to the multimodality of the posterior distribution, the 
GR scale factor can effectively detect the problem. If the chains fail to converge, then the 
GR scale factor will be much larger than 2 ; otherwise, the scale factor should be close to 1. 
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(a) (b) 


Figure 1. Log-posterior probability versus the number of iterations (divided by the num¬ 
ber of covariates p) of 100 randomly initialized Markov chains with n = 500, p = 1000 and 
SNR e {1,3} in the independent design. In all cases, each grey curve corresponds to one tra¬ 
jectory of the chain (100 chains in total). Half of the chains are initialized at perturbations of 
the null model and half the true model, (a) Weak signal case: SNR = 1. (b) Strong signal 
case: SNR = 3 (the posterior probability of the true model coincides with that of the highest 
probability model). 


Convergence of the chain within at most 20p iterations provides empirical confirmation of our 
theoretical prediction that the mixing time grows at most linearly in the covariate dimension 
p. (As will be seen in our empirical studies, the sample size n and sq have little impact on 
the mixing time, as long as sq remains small compared to n.) 

We report the percentage of simulated datasets for which the GR scale factor from six 
Markov chains is less than 1.5 (success). Moreover, to see whether the variable-selection 
procedure based on the posterior is consistent, we also compute the difference between the 
highest posterior probability found during the Markov chain iterations and the posterior 
probability of the true model (H-T) and the difference in posterior probabilities between 
the null model and the true model (N-T). If the true model receives the highest posterior 
probability, then H-T would be 0; if the null model receives the highest posterior probability, 
then N-T would be the same as H-T. 

Table shows the results for design matrices drawn from the independent ensemble. In 
this case, the Markov chain method has fast convergence in all settings (it converges within 
20p iterations). From the table, the setting SNR = 0.5 (respectively SNR > 2) corresponds 
to the weak (respectively strong) signal regime, while SNR = 1 is in the intermediate regime 
where neither the null model nor the true model receives the highest posterior probability. 
Table shows the results for design matrices drawn from the correlated ensemble. Now the 
Markov chain method exhibits poor convergence behavior in the intermediate regime SNR = 1 
with n = 500, but still has fast convergence in the weak and strong signal regimes. However, 
with larger sample size n = 1000, the Markov chain has fast convergence in all settings on p 
and SNR. Comparing the results under the two different designs, we find that correlations 
among the covariates increases the difficulty of variable-selection tasks when Markov chain 
methods are used. Moreover, the results under the correlated design suggest that there exists 
a regime, characterized by n, p and SNR, in which the Markov chain is slowly mixing. It 
would be interesting to see whether or not this regime characterizes some type of fundamental 
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(n,p) 


SNR = 0.5 

SNR = 1 

SNR = 2 

SNR = 3 


SP 

100 

100 

100 

100 

(500,1000) 

H-T 

113.4 

24.6 

0 

0 


N-T 

113.4 

11.4 

-210.9 

-383.6 


SP 

100 

100 

100 

100 

(500,5000) 

H-T 

148.7 

33.2 

0 

0 


N-T 

148.7 

17.4 

-216.6 

-395.9 


SP 

100 

100 

100 

100 

(1000,1000) 

H-T 

117.1 

34.8 

0 

0 


N-T 

117.1 

-6.9 

-342.4 

-649.5 


SP 

100 

100 

100 

100 

(1000,5000) 

H-T 

160.4 

32.8 

0 

0 


N-T 

160.4 

-4.2 

-377.6 

-743.4 


Table 1. Convergence behaviors of the Markov chain methods with sample sizes 
nG {500,1000}, ambient dimensions p£ {1000,5000}, and SNR G {0.5,1,2,3} in the inde¬ 
pendent design. SP: proportion of successful trials (in which GR< 1.5); H-T: log posterior 
probability difference between the highest probability model and the true model; N-T: log 
posterior probability difference between the null model and the true model. Each quantity is 
computed based on 20 simulated datasets. 


limit on computationally efficient procedures for variable selection. We leave this question 
open as a possible future direction. 

3.4.2 Bayesian methods versus the Lasso 

Our analysis reveals one possible benefit of a Bayesian approach as opposed to t'l-based 
approaches such as the Lasso. It is well known that the performance of the Lasso and related 
£i-relaxations depends critically on fairly restrictive incoherence conditions on the design 
matrix. Here we provide an example of an ensemble of linear regression problems for which 
the Lasso fails to perform correct variable selection whereas the Bayesian approach succeeds 
with high probability. 

For Lasso-based methods, the irrepresentable condition 

max m.ayiWX'^X^{X'E< 1 (12) 

|'y|=S* 

is both sufficient and necessary for variable-selection consistency pa SUES]. In our theory for 
the Bayesian approach, the analogous conditions are the upper bound in Assumption D(so) 
on the maximum model size, namely 


So > (2w2a;(A)-b l)s*, (13) 

as well as the sparse projection condition in Assumption B. Roughly speaking, the first con¬ 
dition is needed to ensure that saturated models, i.e., models with size sq, receive negligible 
posterior probability, such that if too many unimportant covariates are included the removal 
of some of them does not hurt the goodness of fit (see Lemma in the Appendix). This 
condition is weaker than the irrepresentable condition since we can always choose sq large 
enough so that so > (2i/“^a;(A) -|- l)s* holds, as long as Assumption B is not violated. 
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(n,p) 


SNR = 0.5 

SNR = 1 

SNR = 2 

SNR = 3 


SP 

100 

95 

80 

100 

(500,1000) 

H-T 

123.4 

75.2 

0 

0 


N-T 

123.4 

71.2 

-107.3 

-275.8 


SP 

100 

15 

100 

100 

(500,5000) 

H-T 

170.0 

81.0 

0 

0 


N-T 

170.0 

78.7 

-102.1 

-288.9 


SP 

100 

100 

100 

100 

(1000,1000) 

H-T 

138.7 

75.1 

0 

0 


N-T 

138.7 

-67.0 

-180.8 

-431.7 


SP 

100 

100 

100 

100 

(1000,5000) 

H-T 

161.8 

61.9 

0 

0 


N-T 

161.8 

-58.8 

-204.2 

-445.4 


Table 2. Convergence behavior of the Markov chain methods with sample size n € {500,1000}, 
ambient dimension p G {1000,5000}, and parameter SNR G {0.5,1, 2, 3} for the case of corre¬ 
lated design. SP: proportion of successful trials (in which GR < 1.5); H-T: log posterior 
probability difference between the highest probability model and the true model; N-T: log 
posterior probability difference between the null model and the true model. Each quantity is 
computed based on 20 simulated datasets. 


As an example, consider a design matrix X G that satisfies 


1 r 

-X^X = 


1 n 

H 1 0 
H 0 1 


0 

0 


G 


ji 0 0 . 1 


with /i = (2y/p)~^. (When p > n,we may consider instead a random design X where the rows 
of X are generated i.i.d. from the p-variate normal distribntion AA(0, Stad)-) This example 
was previously analyzed by Wainwright [M] , who shows that it is an interesting case in which 
there is a gap between the performance of £i-based variable-selection recovery and that of an 
optimal (but computationally intractable) method based on searching over all subsets. For 
a design matrix of this form, we have max|..y|=s*, W^k ^ so that the 

irrepresentable condition fails if s* > 2y^. Consequently, by known results on the necessity 
of the irrepresentable condition for Lasso HUES], it will fail in performing variable selection 
for this ensemble. 

On the other hand, for this example, it can be verihed that Assumption D(so) is satisfied 
with So > 13s*, and moreover, that the the RE(s) condition in Assumption B holds with 
iz = 1/2, whereas the sparse projection condition is satisfied with L = 16(1-|-Sq /r^) = 16-t- 
The only consequence for taking larger values of L is in the /Imin-condition: in particular, 
the threshold Cp is always lower bounded by Consequently, our theory shows that 

the Bayesian procedure will perform correct variable selection with high probability for this 
ensemble. 

To compare the performance of the Bayesian approach and the Lasso under this setup, 
we generate our design matrix from a Gaussian version of this ensemble; i.e., the rows 
of X are generated i.i.d. from the p-variate normal distribution A{(0, Sbad)- We choose 
{n,p,s*) = (300,80,20) so that s*p = 10/\/^ ss 1.1 > 1, i.e. the irrepresentable condition 
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BVS 


LASSO 


Figure 2. Boxplots indicating variable-selection performance of the Bayesian approach (BVS) 
and the Lasso. The boxplots are based on the logarithms of the ratio between the posterior 
probability of the selected model and the true model over 100 replicates. The model selected 
by the Bayesian approach is the median probability model [3] and the regularization parameter 
of the Lasso is chosen by cross-validation. 


fails. Figure shows the variable-selection performance for the Bayesian approach and the 
Lasso over 100 replicates. We report the logarithm of the ratio between the posterior proba¬ 
bility (see equation (|42|)) of the selected model and the true model, where we use the median 
probability model [3] as the selected model of the Bayesian approach. If a variable-selection 
approach has good performance, then we will expect this logarithm to be close to zero. Fig¬ 
ure shows that the Bayesian approach almost always selects the true model while the Lasso 
fails most of the time, which is consistent with the theory. 


4 Proofs 

We now turn to the proofs of our main results, beginning with the rapid mixing guarantee 
in Theorem which is the most involved technically. We then use some of the machinery 
developed in Theorem to prove the posterior consistency guarantee in Theorem Finally, 
by combining these two theorems we prove Corollary In order to promote readability, we 
defer the proofs of certain more technical results to the appendices. 


4.1 Proof of Theorem!^ 

For the purposes of this proof, let P denote the transition matrix of the original Metropolis-Hastings 
sampler ([^. Now consider instead the transition matrix P : = P/2-I-I/2, corresponding to 
a lazy random walk that has a probability of at least 1/2 in staying in its current state. By 
construction, the smallest eigenvalue of P will always be nonnegative, and as a consequence, 
the mixing time of the Markov chain C is completely determined by the second largest eigen¬ 
value A 2 of P. The difference Gap(P) : = I — A 2 is known as the spectral gap, and for any 
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lazy Markov chain, we have the sandwich relation 


1 (1 - Gap(P)) 

2 Gap(P) 


log [l/(2e)] 


< 


Te < 


(log [1/ mm TT ( 7 )] +log(l/e)) 
Gap(P) 


(14) 


See the papers [301137! for bounds of this form. 

Using this sandwich relation, we claim that it suffices to show that there are universal 
constants (ci,C 2 ) such that with probability at least 1 — the spectral gap of the lazy 

transition matrix P is lower bounded as 


Gap(P) > 

p4 


(15) 


To establish the sufficiency of this intermediate claim, we apply Theorem and make use 
of the expression (42) for the posterior distribution, thereby obtaining that for 7 G the 
posterior probability is lower bounded as 


7r„(7 I Y) = 'Knil* I Y) 


I Y) 


-Knil* I Y) 

> .p-(l+"/2)so , p-anl2 


(l + g(l-7?^0)"/^ 

(l + 5(l-i?2))”/2 


with probability at least 1 — Ap . Combining the above two displays with the sandwich 


relation (14), we obtain that there exist constants (c'^,C 2 ) such that for e G (0,1), 

Te < c'l psl (c'aa (n + sq) logp + log(l/e) + 2 ) 
with probability at least 1 — . 


Accordingly, the remainder of our proof is devoted to establishing the spectral gap bound (15), 


and we do so via a version of the canonical path argument m- Let us begin by describing 
the idea of a canonical path ensemble associated with a Markov chain. Given a Markov chain 
C with state space consider the weighted directed graph G{C) = {V, E) with vertex set 
V = ^ and edge set E in which an ordered pair e = ( 7 , 7 O is included as an edge with 
weight Q(e) = Q(7,70 = ^(7)P(7)70 if only if P( 7 , 7 ') > 0 . A canonical path ensemble 
T for C is a collection of paths that contains, for each ordered pair ( 7 , 7 ') of distinct vertices, 
a unique simple path in the graph that connects 7 and 7 b We refer to any path in the 
ensemble 7~ as a canonical path. 

In terms of this notation, Sinclair m shows that for any reversible Markov chain and any 
choice of canonical path T, the spectral gap of P is lower bounded as 


Gap(P) > 


1 


1 —A 2 


pimry 


(16) 


where iiT) corresponds to the length of a longest path in the ensemble T, and the quantity 

p{T) : = max ^ 7 r( 7 ) 7 r( 7 ') is known as the path congestion parameter. 

eeE 

In order to apply this approach to our problem, we need to construct a suitable canonical 
path ensemble T. To begin with, let us introduce some notation for operations on simple 
paths. For two given paths Ti and T 2 : 
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• Their intersection Ti nT 2 corresponds to the subset of overlapping edges. (For instance, 
if Ti = ( 1 , 1 , 1 ) ^ ( 0 , 1 , 1 ) ^ ( 0 , 0 , 1 ) ^ ( 0 , 0 , 0 ) and r 2 = ( 0 , 0 , 1 ) ^ ( 0 , 0 , 0 ), then 
rinr 2 = ( 0 , 0 , 1 )^( 0 , 0 , 0 ).) 

• If r 2 C Ti, then Ti \ T 2 denotes the path obtained by removing all edges in T 2 from 
Ti. (With the same specific choices of (Ti,T 2 ) as above, we have Ti\T 2 = (1,1,1) —)■ 
( 0 , 1 , 1 ) ^( 0 , 0 , 1 ).) 

• We use Ti to denote the reverse of Ti. (With the choice of Ti as above, we have 

Ti = (0, 0, 0) ^ (0,0,1) ^ (0,1,1) ^ (1,1,1).) 

• If the endpoint of Ti and the starting point of T 2 are the same, then we define the union 
Ti U T 2 as the path that connects Ti and T 2 together. (If Ti = (0,0,0) —>■ (0,0,1) and 
T 2 = (0,0,1) —)• (0,1,1), then their union is given by Ti U T 2 = (0, 0, 0) —>■ (0,0,1) —)• 
( 0 , 1 , 1 ).) 

We now turn to the construction of our canonical path ensemble. At a high level, our 
construction is inspired by the variable-selection paths carved out by greedy stepwise variable- 
selection procedures (e.g., @01 [2]). 

Canonical path ensemble construction for First, we construct the canonical path 

from any 7 G ^ to the true model 7 *. The following construction will prove helpful. 
We call a set TZ of canonical paths memoryless with respect to the central state 7 * if: (1) for 
any state 7 G ^ satisfying 7 / 7 *, there exists a unique simple path Ty, 7 * in TZ that connects 
7 and 7 *; (2) for any intermediate state 7 G ^ on any path in TZ, the unique path 

T 7 7 * in TZ that connects 7 and 7 * is the sub-path of starting from 7 and ending at 7 *. 
Intuitively, this memoryless property means that for any intermediate state on any canonical 
path towards the central state, the next move from this intermediate state towards the central 
state does not depend on the history. A memoryless canonical path ensemble has the property 
that in order to specify the canonical path connecting any state 7 G ^ and the central state 
7 *, we only need to specify which state to move to from any 7 7 ^ 7 * in ./#; i.e., we need a 
transition function Q : ^ \ { 7 *} —)• ^ that maps the current state 7 G ^ to a next state 
Qici) £ ■ For simplicity, we dehne ^( 7 *) = 7 * to make ^ as the domain of Q. Clearly, 

each memoryless canonical path ensemble with respect to a central state 7 * corresponds to 
a transition function Q with G{'y*) = 7 *, but the converse is not true. For example, if there 
exist two states 7 and 7 ' so that ^( 7 ) = 7 ' and ^( 7 ') = 7 , then Q is not the transition 
function corresponding to any memoryless canonical path ensemble. However, every valid 
transition function Q gives rise to a unique memoryless canonical path set consisting of paths 
connecting any 7 G ^ to 7 *, with 7 * corresponding to the fixed point of Q. We call function 
0 a valid transition function if there exists a memoryless canonical path set for which Q is 
the corresponding transition function. The next lemma provides a suffcient condition for a 
function Q : ^ \ { 7 *} —?• ^ to be valid, which motivates our construction to follow. Recall 
that du denotes the Hamming metric between a pair of binary strings. 

Lemma 1. // a function Q : ^ \ { 7 *} —)• ^ satisfies that for any state 7 G ^ \ 7 *, the 
Hamming distance between 0(j) and 7 * is strictly less than the Hamming distance between 7 
and 7 *, then Q is a valid transition function. 

Proof. Based on this function Q, we can construct the canonical path Ty,-)-* from any state 
7 G to 7 * by defining as 7 —)• Gil) ( 7 ) G^'^ ( 7 ), where G^ := G o .. - oG 
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denotes the /c-fold self-composition of Q for any /c G N and : = minfc{^^( 7 ) = 7 *}. In order 
to show that the set {T-y^-y* : 7 G ^,7 7 ^ 7 *} is a memoryless canonical path set, we only 
need to verify two things: 

(a) for any 7 7 ^ 7 *, Ty^y* is a well-defined path; i.e., it has finite length and ends at 7 *, and 

(b) for any 7 7 ^ 7 *, Ty^y* is a simple path. 

By our assumption, the function F : ^ —)■ M defined by ^( 7 ) = dni'y, 7 *) is strictly decreasing 
along the path Ty^y* for 7 7 ^ 7 *. Because F only attains a finite number of values, there exists 
a smallest ky such that ^^^^( 7 ) = Q^{^) for each k > ky, implying that ^^''( 7 ) is a fixed point 
of Q. Since 7 * is the unique hxed point of Q, we must have ^^'>'( 7 ) = 7 *, which proves the first 
claim. The second claim is obvious since the function F dehned above is strictly decreasing 
along the path Ty^y*, which means that the states on the path Ty^y* are all distinct. □ 

Equiped with this lemma, we start constructing a memoryless set of canonical paths from 
any state 7 G ^ to 7 * by specifying a valid G function. First, we introduce some definitions 
on the states. A state 7 7 ^ 7 * is called saturated if I 7 I = sq and unsaturated if I 7 I < sq- We call 
a state 7 7 ^ 7 * overfitted if it contains all influential covariates, i.e. 7 * C 7 , and underfitted if 
it does not contain at least one influential covariate. Recall the two updating schemes in our 
Metropolis-Hastings (MH) sampler: single flip and double flips. We accordingly construct the 
transition function G as follows. 

(i) If 7 7 ^ 7 * is overfitted, then we define ^( 7 ) to be 7 ', which is formed by deleting the 
least influential covariate from 7 , i.e. 7 ' = 7 j for any j 7 ^ iy and 7 ^ = 0 , where £y is 
the index from the set 7 \ 7 * of uninfluential covariates that minimizes the difference 

\\%Xy* 13*^41 - \\<^y\{g}Xy. 13* 4l 

where <I>.y denotes the projection onto the span of {Xj,j G 7 }. This transition remsem- 
bles the backward deletion step in the stepwise variable-selection procedure and involves 
the single flip updating scheme of the MH algorithm. By construction, if 7 7 ^ 7 * is over¬ 
fitted, then dH{G{j),'y*) = dniG4)^1*) - 

(ii) If 7 7 ^ 7 * is underfitted and unsaturated, then we dehne ^( 7 ) to be 7 ', which is formed 
by adding the influential covariate from 7 * \7 that explains the most signal variation, i.e. 
7 ' = 7 j for any j 4 J 7 and 7 '^ = 1, where jy is dehned as the j G 7 * \ 7 that maximizes 
the quantity W^yyjf^j'^Xy *This transition remsembles the forward selection step 
in the stepwise variable selection procedure and involves the single hip updating scheme 
of the MH algorithm. By construction, if 7 7 ^ 7 * is underhtted and unsaturated, then 
dniGil),!*) = dH{Gij),j*) - 1 . 

(hi) If 7 7 ^ 7 * is underhtted and saturated, then we dehne ^( 7 ) to be 7 ', which is formed 
by replacing the least inhuential unimportant covariate in 7 with the most inhuential 
covariate from 7 * \ 7 , i.e. 7 ' = 7 j for any j 0 {jy, ky}, 7 '^ = 1 and 7 ^^ = 0 , where jy 

is dehned in case 2 and ky G 'y\'y* minimizes W^yyj^j^Xy*~ ll‘h 7 u{i}\{fc}^ 7 */ 37 * 111 - 
This transition step involves the double-hip updating scheme of the MH algorithm. By 
construction, if 7 7 ^ 7 * is underhtted and saturated, then dj 7 (^( 7 ), 7 *) = d/f ( 7 ), 7 *) — 

2 . 
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Figure 3. Illustration of the construction of the canonical path ensemble. In the plot, 7 * is 
the central state, Q is the transition function and solid blue arrows indicate canonical paths 
1 / iy 2 ,^ * and '-L ■ 


By Lemma this transition function Q is valid and gives rise to a unique memoryless set of 
canonical paths from any state 7 G ^ to 7*. For example, Figj^shows such a memoryless set 
of canonical paths for ^ consisting of 14 states, where corresponds to the canonical 

path from state 72 to the central state 7*. 

Based on this memoryless canonical path set, we can finish constructing the canonical 
path ensemble T by specifying the path connecting any distinct pair ( 7 , 7 ^ £ x 
More specifically, by the memory less property, the two simple paths Tly, 7 * and share an 

identical subpath towards 7 * from their first common intermediate state. Let T^n'y' denote 
this common subpath n Ty^^*, and T^\y ■ = Fy^y* \ Fypy/ denote the remaining path of 
Fy^y* after removing the segment Fyny'. We define Ty\^ in a similar way as Fy/^y* \ Fyny'. 
Then it is easy to see that the two remaining paths Fy\y/ and Fy/\y share the same endpoint. 
Therefore, it is valid to define the path T^^y as T^\yUTy\^. To understand this construction, 
let us consider an example where Fy^y* = (0, 1 , 1 ,1 ) —)• (1,1,0,1) —)• (1,1,0,0) and Fy/ y* = 
(1,0,0,1) —>■ (1,1,0,1) —>■ (1,1,0,0). Their intersection is Fypy/ = (1,1,0,1) —)> (1,1, 0,0) 
and the two remaining paths are T^\y = (0,1,1,1) —)■ (1,1,0,1) and Ty\^ = (1,0,0, 1 ) —)■ 
(1,1,0,1). Consequently, the path Fy y/ from 7 to 7 ' is (0,1,1,1) —)■ (1,1,0,1) —)■ (1,0,0,1) 
by our construction. For example, path Fyg^y^ in Fig [^illustrates the construction of the path 
connecting ( 73 , 74 ) when ^ is composed of 14 states. 

We call 7 a precedent of 7' if 7' is on the canonical path Fy^y* G T, and a pair of states 
7,7' adjacent if the canonical path T^^y is e^^y, the edge in E connecting 7 and 7'. For 
7 G let 


A( 7 ) : = {7 I 7 e Fy,y.} (17) 

denote the set of all its precedents. Use the notation |F| to denote the length of a path F. 
The following lemma provides some important properties of the contracted canonical path 
ensemble that will be used later. 
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Lemma 2 . For any distinct pair (7,7') G ^ x 
(a) We have 


1 ^ 7 , 7 *! < So, and (18a) 

<d//(7,7*) + c^//(7',7*) < 2 so. (18b) 

(b) If 'y and 7' are adjacent (joined by edge e^^y) and y is a precedent of y', then 

{(7,7 ) I ^ ^ 7 , 7 '} ^ ^(7) X 

Proof. The first claim follows since the function F : .M —)• M defined by F{y{) = dH{'y,'y*) is 
strictly decreasing along the path T^, 7 * for 7 7 ^ 7 *. Now we prove the second claim. For any 
pair (7,^) such that 3 e^^y, either e^^y G T^\y or ey^^ G Ty\^?^ should be satished since 
= Tyy U Tyy by our construction. Because 7 is a precedent of 7', we can only have 
e.yy G Tyy. This shows that 7 is on the path and 7 G A(7). □ 

According to Lemma [^(b), the path congestion parameter p{T) of the canonical path T 
satisfies 


p{T) < max 7 r( 7 ) 7 r( 7 ') = max 

(7,7')6r* Q(7,7') (7,7')6r* Q(7,7') 


(19) 


where the maximum is taken over the set 

r* := I (7,70 £ X I = ey,^y and 7 G A(7')|. 

Here we used the fact that the weight function Q of a reversible chain satishes Q( 7 , y') = Q(7^ 7 ) 
so as to be able to restrict the range of the maximum to pairs (7,70 where 7 G I^{y'). 

For the lazy form of the Metropolis-Hastings walk given any pair ( 7,70 such that 
P(7)70 > 0) ^6 have 


Q(7,70 = I ^)P(7,70 

^ 1 / I • ri I T)y 

> - 7 r„( 7 | y)mm{l,———I = 


2 pso 


7r„(7 I y) 2pso 


min{7r„(7' | Y), 7r„(7 | T)}. 


Substituting this lower bound into our upper bound (19) on the path congestion parameter 
yields 


p{T) <2pso max 


7rn[A(7) I y] 


= 2p So max 

(7,7')er^ 


(7,7')6r* min{7r„(7 | y),7r„(Y | y)} 

I I TTnll Y) / TTrilT' yl / 


( 20 ) 


In order to prove that p{T) = O{pso) with high probability, it suffices to show that the two 
terms inside the maximum are 0(1) with high probability. In order to do so, we make use of 
two auxiliary lemmas. 
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Given the constant Cq > 4 and the noise vector w ~ AA(0, consider the following 

events 


j\.n • — 

Bn.= { 
Cn : = { 


max 


{7i.72)6^x^ I 71 I — I 72 
72 C 71 






max • 

t 7g^ 

Irclh 


ItI 


- 1 


< rcjQ 


logpj, 


and 


nUn 


< - 
“ 2 




and 


r 


9 


( 21 a) 

( 21 b) 

( 21 c) 


Our first auxiliary lemma guarantees that, under the stated assumptions of our theorem, 
the intersection of these events holds with high probability: 


Lemma 3. Under the conditions of Theorem^ we have 

P(A n fin n Cn n 2?„) > 1 - 6p- 

We prove this lemma in Section |4.2| to follow. 


( 22 ) 


Our second auxiliary lemma ensures that when these four events hold, then the two terms 
on the right-hand side of the upper bound ( 20 ) are controlled. 

Lemma 4. Suppose that, in addition to the conditions of Theorem the compound event 
An O ^ holds• T^hGTh j'OT' (ill 'y 'y j IDG hdDG 


0-2 


TTnil I y) 

T^n{Q{l) I y) ~ \p~^ 


< 


if j is overfitted, 
if j is underfitted, 


and moreover, for all 

7rn[A(7) I Y] 


< c 


I y) 

We prove this lemma in Section |4.3| to follow. 


for some universal constant c. 


(23a) 


(23b) 


Combining Lemmasandwith our earlier bound (20), we conclude that p{T) < 2cpsQ. By 
Lemma [^(a), our path ensemble Y has maximal length ^(T) < 2so, and hence the canonical 
path lower bound ([ 
completes the proof of the theorem. 


implies that Gap(P) > ^^^^ 2 ; as claimed in inequality (15). This 


The only remaining detail is to prove Lemmas and and we do so in the following two 
subsections. 

4.2 Proof of Lemma | 3 ] 

We split the proof up into separate parts, one for each of the events An,Bn,Cn and Vn- 

Bound on P[Cn]: Since H'lalll/o'o ~ Tni a standard tail bound for the 7 ^ distribution 
(e.g., m, Lemma 1) yields 

P[Cn] > 1 - 2e"i. (24) 
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Bound on P[Bri]: For each state 7 G the random variable w'^^^w/aQ follows a chi- 
squared distribution with I 7 I degrees of freedom. For each integer G {1,..., so}, the model 
space ^ contains (^) models of size £. Therefore, by a union bound, we find that 

Jo / \ ■^o 

n^n] > 1 - > Conogp) > 1 _ 

£=1 ^ 1=1 

= 1 - (25) 


Bound on 


Given the linear observation model, we have 
||F||2 = \\X/3* +w\\l < 2\\xp*f + 2\\w\\l 


Combining this with inequality (24), we obtain 


|y||^ > 2||X/3*||i+ 3ncT^] <2e"i <p-^o{r/4-i) 


for large n and some constant C > 0, where we have used Assumption D. By Assumptions A 
and D, we have ||X/ 3*||2 < 2naQgjs*, implying that 

F[V^] < F[\\Y\\l > ^\\XP*\\l + 3 nug] < (25) 


Bound on P[An]: To control this probability, we require two auxiliary lemmas. 

Lemma 5 . Under Assumption B, for any distinct pair (7,7) G ^ x ^ satisfying 'y <Z^, we 
have 


At, 




n 


> V. 


Proof. By partitioning the matrix X^ into a block form {X.y, and using the formula for 

the inverse of block matrices, one can show that the lower right corner of {n~^Xf^X?y^ ^ is 
{n~^XT^^{In — which implies the claimed bound. □ 

Lemma 6. For 7 G and /c ^ 7, we have 

’ XpI-1>,)X, ' 

Proof. By the block matrix inversion formula m, we have 

_ \B + aBX'^^- 

m'X, X'iXkl 

\-i 


B + aBXf^XkXlXB -aBXi^Xu 

—aX'^ X.yB a 


where B = (A^ X.y/ 
yields 

$ 


and a = {X^ (I — ^.y)Xk) ^ G M. Then simple linear algebra 


7 U{fc} 


'x^x^ xTXk 
VHX, xtx, 

= a(I-<k^)AfcAj(/-ck^), 


- = [A., Afc] 


1 -1 

[A^l 


J 




which is the claimed decomposition. 


□ 
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Returning to our main task, let us define the event 


S.t. fc^7 

By construction, we have A'^ C An so that it suffices to lower bound Lemma [^implies 

that 




|((I - <^j)Xk, w)\^/n 
Xl{l-^^)Xk/n 


(27) 


Now we show that with probability at least 1 the above quantity is uniformly bounded 

by LcJq logp over all ( 7 , k) G M x { 1 ,... ,p} satisfying I 7 I < sq and fc ^ 7 , which yields the 
intermediate result 


p(x) > > 1 - p- 


(28) 


Now Lemma I implies that ^X'^ (/ — > V, and therefore, if we define the random 

variable 

V{Z):= max ^\{{l - ^AXk, Z)\, where Z ~ iV(0, 4), 

76^, yU 

s.t. k^'y 

then it suffices to show that V{Z) < ^Lu log p with probability at least 1 —p~^. For any two 
vectors Z, Z' G M"", we have 

\V{Z) - V(Z')I < max ^j)Xk, Z - Z')\ 

76.^,fce{i,...,p} yn. 
s.t. A:^7 

<^\\[l-^^)Xk\\2\\Z-Z'\\2 < \\Z-Z'\\2, 


where we have used the normalization condition of Assumption B in the last inequality. Con¬ 
sequently, by concentration of measure for Lipschitz functions of Gaussian random variables 
[ 20 ] . we have 


P[R(Z) > E[F(Z)]+t] < e"V. (29) 

By the sparse projection condition in Assumption B, the expectation satishes E[1/(Z)] < 
\/ Lv log p /2, which combined with (29) yields the claimed bound (28) with c < Lu/^. 


4.3 Proof of Lemma 0 ] 

We defer the proof of the claim (23a) to Appendix [B] as it is somewhat technically involved. 
It is worth mentioning that its proof uses some auxiliary results in Lemma in Appendix |B] 
which characterizes some key properties of the state ^( 7 ) selected by the transition function 
Q via the greedy criterion. 

It remains to prove the second bound (23b) in Lemma and we split our analysis into 
two cases, depending on whether 7 is underfitted or overhtted. 
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4 . 3.1 Case 7 is underfitted 


In this case, the bound ( |23a ) implies that — P each 7 G A( 7 ), where A( 7 ) is 

defined in Lemma [^e), we know 7 G C T^,'y*- Let the path be 70 —)■ 71 7 ^, 

where s = is the length of and 70 = 7 and 7 s = 7 are the two endpoints. Since 

any intermediate state 7 on path is also underfitted, inequality (23a) ensures that 


I ^ A Mle-l I ^ 3 s ^ „- 3 |T,.,| 
TTnhlY) l\ MlllY) ^ 


Now for each s G {0,..., s*}, we count the total number of states 7 in A( 7 ) that satisfies 
\T^,-y\ = s. By construction, at each intermediate state in a canonical path, we either add a 
new influential covariate by the single flip updating scheme of the MH algorithm, or add a new 
influential covariate and delete an unimportant covariate by the double-flip updating scheme. 
As a consequence, any state in ^ has at most (s* + l)p adjacent precedents, imlying that the 
total number of states 7 in A( 7 ) with path length = s is upper bounded by (s* -|- 1)^p^. 
Consequently, we have by the preceding display that under the event An n n n Vn 


7r^[A(7)|y] 

I Y) 


E 

76/(7) 


T^nil I Y) 

'^nillY) 


S* CO 

{s* + iyp-^^<Y.p-^< 

s=0 s=0 


1 

1 - i/p' 


The above argument is also valid for 7 = 7 *. 


(30) 


4 . 3.2 Case 7 is overfitted 

In this case, we bound the ratio by dividing the set A( 7 ) into two subsets: 

(a) Overfitted models: Mi = { 7 ' G A( 7 ) : 7 ' D 7 *}, all models in A( 7 ) that include all 
influential covariates. 


(b) Underfitted models: M 2 = { 7 ' G A( 7 ) : 7 ' 7 ) 7 *}, all models in A( 7 ) that miss at least 
one influential covariate. 


First, we consider the ratio 7 rn(Mi | Y)/'Kn{n I Y). For each model 7 G Mi, according to our 
construction of the canonical path, all intermediate states on path = 7 o 71 7 ^ 

correspond to overfitted models (only involve the first flipping updating scheme of the MH 
algorithm), where endpoints 70 = 7 and 7 ^ = 7 , and k denotes the length of path ^^, 7 . As a 
consequence, inequality (23a) implies that 


^n(7 I Y) ^ -A 7rn(7^-i I E) ^ 2 fc 
I Y) 7r„(7^ I Y) 


Since there are at most p^ states 7 in Mi satisfying I 7 I — I 7 I = k, we obtain that under the 
event An n n H 


7rn(Mi I F) 
'^nil I Y) 


P-I7I 00 

< - 
k=0 k=0 


1 

1 - i/p 


< 2 . 


(31) 


Second, we consider the ratio '/r„(M 2 | Y)/iTn{'y \ Y). For fixed 7 G M 2 , let 7 ( 7 ) be the first 
state along the path that contains all influential covariates. Since the overfitted state 7 
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contains all influential covariates, f{'y) exists and is well-defined. Moreover, this construction 
ensure that f{j) G Mi and 7 C A{f{j)) \ {/(7)}. Applying inequality (|23a) then yields 


7rn(M2 I Y) 

I y) 


E 

7eM2 


7rn(7 I y) 

I y) 


E 

7eM2 


7i'n(/(7) I y) 7r„(7 I y) 


I y) \ E) 


< 


E 


I y) 


E 


I y) 


I y) I E) 

3 7eM2 ^ 7eA(7)\{7} ^ 

such that7=/(7) 


E 


'^n(7 I E) /7r„[A(7) I E] 


3 '7EM2 

such that 7 =/( 7 ) 


7r„(7 I Y) \ 7r„(7 | Y) 


- 1 


Then by treating 7 = 7 ( 7 ) G Mi as the 7 in inequality (30) and inequality (31), we obtain 
that under the event An CiBn^CnA Vn 


7rn(M2 I E) 
I E) 


< 


E 


3 • 7 GM 2 

S.t. 7 =/( 7 ) 


T^nil I E) 
^n(7 I E) 


I 1 — 1 /p / 


^ 2 T^n{l I E) 
“ p E—' TTnil I E) 

^ yeMi ^ 


(32) 


2 TTri 




E) 


p TTnil I E) 


4 

< 


Combining inequality (31) and inequality (32), we obtain that that under the event AnA\BnA\ 


Cn n Vn, the posterior ratio is upper bounded as 

vrn[A( 7 ) I E] _ 7r„(Mi | E) 7r^(M2 | E) 
7r„(7 I Y) 'Knil I E) 7r„(7 | Y) 


< 6 . 


(33) 


The above argument is also valid for 7 = 7*, and this completes the proof of inequality (23b). 


4.4 Proof of Theorem [T] 

We divide the analysis into two steps. In the first step, we show that the total posterior 
probability assigned to models with size 0{s*) other than 7* is small. In the second step, we 
use the fact that all large models receive small prior probabilities to show that the remaining 
models should also receive small posterior probability. 


Step 1: Let M5 : = {7 G {0,1}^ : I7I < Ks*, 7 / 7*} denote the set of all models with 
moderate sizes, where K > 1 some constant to be determined in step 2. Consider the quantity 


7 rn(Mg I Y) 

I E) 


E 

7 GM 5 


7rn(7 I E) 
7r„(7* I E)' 


(34) 


Similar to Lemma we modify the definition of the four events An, Bn, Cn and Vn by 
replacing ^ with Mg. Following the proof of Lemma it is straightforward to show that 
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these four events satisfy 


AnnBnnCnn Vr, 


> 1 — 6p 


( 35 ) 


The following auxiliary lemma ensures that when these four events hold, then the posterior 
ratios on the right hand side of equation ( 34 ) are well controlled. 


Lemma 7. Under Assumptions A-D and under the event An H Bn U Cn U 'Dn, the posterior 
ratio of any 'y (A 1*) in Mg is bounded as 


TTnij I T) ^ jp if'j is overfitted, 

7r„(7* I Y) ~ if ^ is underfitted. 


We prove this lemma in Appendix [C} 

Equipped with this lemma, a simple counting argument yields that under the event An H 
BnHCnnVn, 


I V) « V„V“ 

7r„(7* I Y) ^ 


+ ^ < 3 p \ 


k=i e=o 

where in step (i), we used the fact that there are at most p^ overfitted models 7 with |7\7*| = k 


and at most underfitted models 7 with I7I = i. Combining this with inequality ( 35 ), we 
obtain that with probability at least 1 — Qp~^, 


7r„(Ms I y) < 3p 7r„(7* | T) < 3p h 


(36) 


Step 2: Let M^, : = {7 G {0,1}^* : I7I > Ks* + 1} denote the set of large models. By Bayes’ 
theorem, we can express the posterior probability of Mg as 


7r„(Mg I y) = 




E 


7 e{o,i}p h,<i) 




{Y)'Kn{de,d 4 ),'y) 


(37) 


where and Pq stand for probability distribution of Y under parameters (/?, 4 >, 7) and 

the true data generating model, respectively. We bound the numerator and denominator 
separately. 


First consider the numerator. According to our specification of the sparsity prior ( 5 d) for 
the binary indicator vector 7, the prior probability of Mg satisfies 


-Ks*-1 


7rn(Mg) = Mi)<p 

7; |7|>i<'s*+l 

By Fubini’s theorem we have the following bound for the expectation of the numerator: 




'[y 

7eM£ 


, 4 >,i 




dPo 


{Y)TTn{d0,d(f,j) 




(y) 7r„(d6',#,7) 


E / 

7 eMi 

[ Trnid6,d4>,'y) = 7r„(Mg) < p 


-Ks*-l 


7 eML 


' 9 , 4 > 
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where we have used the fact that Eq (y)] = 1. Therefore, by applying Markov’s 

inequality we have 


IPo 




{Y)-Kn{d6,d(p,'y) < p ^ 


>l-p-Ks*/2_ 


(38) 


By the expression (41) of the marginal likelihood function, we can bound the denominator 
from below by 



dFo 


{Y) 7r„((i6»,d(/),7*) 


£^(y|7*)7r„(7*) 

dPo(y) 


r(f)(l+5)"/2 (1 + 5)-"*/' cp-2** 

W2 (||y||2 + ^||(J_cI>^*)^||2)n/2 ' dPo(y)’ 


where w = w + Xs<^jd*sc ~ M{Xs<=/3gc,(7Q). Under the true data-generating model Eoj 
the density for Y is (T(^”(27r)“”/2gxp{—(2cTQ)“^||t(;||2}. By applying the the lower bound 
r(re/2) > (27r)^/^ (n/2 — i^+ 2 -i/ 2 g-n/ 2 +i using the fact that the projection operator 
I — is non-expansive, we obtain 


[ ^^^(y)vr.(d0,#,7*) >cp-2^*(l + 5)-"*/'(l + 5-')"/' 

.exp{( 27 )-‘(||w||i- ||u.||i - ||F||i/9)} 

V _ ^ V _ ^ 

/(“) !//(+ 

where u = o'(j'^(||t (}||2 + ||y|||/ 5 ). Since g~^ < n~^ and the function f{u) = i(-+ 2 gM /2 attains 
its minimum at u = n, we further obtain 




19,4) 


{Y)TTn{de,d(t),-f*) > cp ^*’(1 + 5) exp{(2cro) ^{\\w\\l - \\w\\l - \\Y\\l/g)}, 


with a different universal constant c. 

The off-support condition in Assumption A and the high probability bound for the 
event Cn H in Lemma imply that the last exponential term is of order p~^^^ for some 
universal constant d with probability at least 1 — p~^'^. Therefore, for LC > 4 + a + ‘lc\L, we 
have 


[ (^) T^nidd, d(j), 7 *) > cp 

Je,4> h+o 


Combining equations (37), (38) and (39), we obtain that 

7r„(ML I Y) < cp~^ 

holds with probability at least 1 — 2p“^^ 

Finally, inequalities 


and (40) in steps 1 and 2 together yield that 
7r„(7* I y) = 1 - 7r„(M5 | y) - 7r„(Mi | y) > 1 - c^p-\ 


(39) 


(40) 


holds with probability at least 1 — 8p ^ , which completes the proof. 
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4.5 Proof of Corollary 

Let Pi denote the probability distribution of iterate 7 * in the MCMC algorithm. According to 
the definition of e-mixing time, for any t > T^/p, we are guaranteed that |Pi( 7 *) — 7r„(7*)| < 

By Theorem the posterior probability of 7 * satishes 7rn(7*) > 1 — with probability 

at least 1 — C 2 P~'^^- By Theorem the p“^-mixing time Ti/p satishes 

Ti/p < Cl psl {c 2 a (n + So) logp + logp + 2 ) 

with probability at least 1 — Combining the three preceding displays, we hnd that 

1 * 1 ( 7 *) > 1 — (ci + as claimed. 


5 Discussion 

In this paper, we studied the computational complexity of MCMC methods for high-dimensional 
Bayesian linear regression under a sparsity constraint. We show that under a set of conditions 
that guarantees Bayesian variable-selection consistency, the corresponding MCMC algorithm 
achieves rapid mixing. Our result on the computational complexity of Bayesian variable- 
selection example provides insight into the dynamics of the Markov chain methods applied to 
statistical models with good asymptotic properties. It suggests that contraction properties of 
the posterior distribution are useful not only in guaranteeing desirable statistical properties 
such as parameter estimation or model selection consistency, but they also have algorithmic 
benefits in certifying the rapid mixing of the Markov chain methods designed to draw samples 
from the posterior. 

As a future direction, it is interesting to investigate the mixing behavior of the MCMC 
algorithm when Bayesian variable selection fails. For example, slow mixing behavior is ob¬ 
served empirically in the intermediate SNR regime in our simulated example and it would be 
interesting to understand this result theoretically. Another interesting direction is to consider 
the computational complexity of MCMC methods for models more complex than linear re¬ 
gression, for example, high-dimensional nonparametric additive regression. A third direction 
is to investigate whether the upper bound on mixing time provided in Theorem is sharp up 
to constants. 
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A Further details on Metropolis-Hastings 

In Appendix [A] we show that under the specified model, the maximum a posteriori solution 
(MAP) of the Bayesian variable-selection problem is equivalent to the following optimizatio 
problem with 


7 = arg mm 

|7|<so 


' n , 


^ +5'(l-l|y||2 ) 
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where X ^)~^is the projection onto the column space of X^, and the regular¬ 

ization parameter A : = ^ log(l + g) + Klogp. Here the penalty Ajyl comes from two sources: 
the penalty nlogp lyj on 7 and the Occam’s razor penalty ^ log(l+S') I 7 I due to the integration 
over the model parameter (3^. Therefore, choosing an appropriate hyperparameter k in the 
Bayesian approach is equivalent to choosing a corresponding regularization parameter A in 
the penalization method: a small k could make the posterior include uninfluential covariates 
due to noise; a large k requires the signal-to-noise ratio (3j/cr of influential covariates to be 
large enough so that they can be selected out by the posterior. 


A.l Connections to MAP estimates 

Under the Bayesian model specihed by equation ([^, we can obtain a closed-form expression 
for the marginal likelihood of the indicator vector 7 by integrating out and 4>'. 


Cn{Y\ 7 ) : = 'Kn{Y\ 7 ) = y vrn(d/3, d(j) I 7 ) 

r(§)(i + <7r/2 (i + 5)-h/2 


W 2 ||y||" {I + g(l - m))n/2- 


(41) 


where r(-) the Gamma function, IP/?,,/),7 is the distribution of Y under parameters (/?, i?i, 7 ), 
and is the coefficient of determination for the model M.y 

Y^^,Y 

-^7 IIVl|2 ’ 

lU II 2 


with = X^{X'^X^)~^X'^ the projection onto the column space of When there is 
no confusion, we identify the variable inclusion vector 7 and the linear model M..,, associated 
with it. Let ^ : = {7 : I 7 I < so} denote the entire model space, which is a subset of 
the p-dimensional hypercube {0,1}^ under our identification. Then, by Bayes’ theorem the 
posterior probability of 7 is given by 


vrn(7 \Y) = C ■ 


1 

pi^b\ 


(l + g(l-i22))n/2 


I[7 G 


(42) 


where C is a normalization constant. 

According to the preceeding display, the maximum a posteriori (MAP) solution of the 
Bayesian variable-selection problem is equivalent to the following penalized optimization prob¬ 
lem fo-penalty 


7 = arg mm 


| 7 |<so I 2 




^ + 5(1 -||y||2 ) 


+ 


A|7l}, 


where the regularization parameter A : = ^ log(l -\- g) + logp. Conversely, if we have a variable- 
selection procedure based on the penalization method with £o-penalty 


7 = arg min {/(y, 7 )A| 7 |}, 
7 ; |7|<so 


where /(y, 7 ) is some function reflecting the goodness of fit by using model M..),, then we can 
construct a pseudo-posterior distribution 

T^n{l\Y) = C ■ l[|7| < so] 
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with C is a normalization constant, and conduct Bayesian inference based on 7 r„. For example, 
when f{Y,'y) is the negative profile log-likelihood ^ log (^||(/ —<h^)y|| 2 ), where the regression 
coefficient [3^ and the precision parameter 4> have been profiled out from the log-likelihood 
given 7 by maximization, then the choice of A = 1 corresponds to the Akaike information 
criterion, or AIC for short [I]), A = logn /2 the Bayesian information criterion, or BIC for 
short m ), and A = alogp (a > 1) the high-dimensional BIC [32]. Our conclusion on the 
MCMC complexity of Bayesian variable selection with Zellner’s g'-prior applies to BIC in the 
low-dimensional regime where p = o{n), and to high-dimensional BIC in the high-dimensional 


regime where p > n. Because of the Bayesian interpretation for 7 r„( 7 |y) in equation (42), we 
will focus on this posterior distribution over the model space 


A.2 Example of slow mixing 

Suppose p = n and g = p‘^°‘ with a > 1. Let Y = w ^ We claim that if 

we use the untruncated distribution TTnij) = as the prior for the variable-selection 

indicator vector over the entire space {0,1}^, then the mixing time of the Markov chain with 
transition probability specified by formula ([^ grows exponentially in n with probability at 
least 1/2 with respect to the randomness of w. Moreover, it is easy to check that this example 
satisfies the conditions in Theorem which imply Bayesian variable-selection consistency. 
As a consequence, this example suggests that although a size constraint I 7 I < sq such as the 


one in the sparsity prior (5d) is not needed for Bayesian model selection consistency, it is 


necessary for MCMC to mix rapidly. 


Proof of slow mixing: We use the following conductance argument: for any reversible 
Markov chain C over a finite state space, the spectral gap is upper bounded as 


1 — A 2 < 2<hc, 


where the quantity 


<I>c := min ‘he(A), where <hc(A) := ^ 

Ac^-.0<n{B)<l ^ ^ ^ 7r(A)7r(A7 


(43) 

(44) 


is called the conductance of C m- 

Now we analyze the mixing time of the Markov chain in the previous example. Use the 
notation 1 to denote the full model. Under the prior choice in the theorem, the posterior has 
an expression as 


'Kn{l\Y) OC 


( 1 + 5 ) 


-I 7 I /2 


pi 


for 7 e { 0 , 1 }P. 


(l+5(l-i?2))r 

Now we apply inequality (43) and equation ( |44| ) with B = {1} to obtain 

2 ELi ^n(l I Y)P{1, !_,) _ 2 Er=i 1-. 


(45) 


1 - A 2 < 2<hc < 


7 r„(l I U)(l- 7 r„(l I Y)) 


1 - vr„(l I U) 


where we have used the fact that under the transition probability specification ([^, the only 
“neighbor” of 1 is 1-j for j = 1 ,..., n, i.e. 7 (/ 1 ) satisfies P(l, 7 ) > 0 if and only if 7 = l_j 
for some j G {1,..., n}. Using ([^ and the last display, we can further obtain 


1 — A 2 < 


^(Liminjl, 


7rn(l-j|y) ' 
7r„(l|y) . 


n(l - 7 r„(l I Y)) 


(46) 
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We consider the numerator of the right-hand side in equation (46) first. Since the true 
model is the null model, we have 




7r„(l-j I Y) _ _ 

7rn(l I Y) (^1 + gw'^{I — P-j)w/{w'^w))^^‘^ 


(47) 


where P-j is the projection onto W_j. Since {w'^{I — P-j)w}^^i are random variables, 
by the union bound and the tail probability of distribution, we have that for constant ci 
sufficiently small, 

min w'^il — P-i)w > —> 1 — n • 
j=i,-,n n^J 4n 4 

Moreover, by a standard tail bound for the Xn distribution (uni, Lemma 1) we have 


w w < -n > -. 
- 2 J - 4 


Combining the last two displays, we obtain 

— P-j)w 


mm 




- n'>) - 2 ’ 


where C 2 = 2ci/3. Combining the above with equation (47), we obtain 

7r„(l_j I Y) 


mm 




.j=l,...,n 7r„(l I Y) 

where we have used the inequality 1 -|- x < for x G M and C 3 is some universal constant. 


(48) 


Now we consider the denominator of the right-hand side in (46). Recall that 0 denotes 


the indicator vector in {0,1}^ associated with the null model. By equation (45), we have 

7rn(l|y) _ (1 + g)-"/^ 

7rn(0 I Y) 

for n > 2. This implies 


n" 


(l + g)n/2 


= n 


~ 2 ’ 


vr„(l|y) = 


vrn(l I Y) 


^ vr^llT) ^ 1^ 


7r,,(0|y) 


(49) 


Combining equations (46) 


and (49) yields 


1 - A 2 < 2e-'=39n-2^ > 


which completes the proof of the claimed result. 


B Proof of inequality (23a) in Lemma 


Since 7 7 ^ 7 *, we know that 7 ' : = ^( 7 ) 7 ^ 7 . We divide the proof into the following three 
disjoint cases: 

• model 7 is overhtted, 

• model 7 is underfitted and unsaturated, 

• model 7 is underfitted and saturated. 
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B.l Case 7 is overfitted 

Let be the index selected from the set 7 \ 7* of uninfluential covariates in our construction 
of the transition function i.e. 7' = 7 \ {^7}- We can express the posterior probability ratio 
as 


7r„(7 I F) _ 1 + 5(1 - 

I Y) + 9 ^ 1 + ff(l - -R7) ^ 

= 1 . (i , R'-Ry w/2 

p'^^yTT~g V g-^ + {l-R^)J 


Since all influential covariates are included in models and M^/, we have 

||(/-$^)(X(^yc/3* ..+u;)||i ||(/-cl>y)(X(^yc/3* ).+u;)||2 

l-R^ = -- and 1-Ry = -—^-. 


Applying the Cauchy-Schwarz inequality yields 

i||(/- <h>||i - 11(1 - 


l-i?7 > 




W II (L ‘h^)rc||| 2||X(^*)c/3*^*^c||2 (“) ||(/— <f>^)rt;||| — 2 ZcJq logp 


2riii 


2riii 


where in step (i) we used the fact that the projection is a non-expansive mapping and in step 
(ii) we used the assumption ||^(7*)c/3(ty*)c|l2 ^ La^logp. Similarly, since 7' C 7, we obtain 
the following inequality for the quantity R'^ — Ry 

2 p2 _ “ ‘^7')(^(7*)'=/^(7*)': + ^)ll2 ^ 2||(<1>^ - <hy)rt;||2 + 2L(Tq logp 

R^ - Ry — ||y||2 — ||y||2 • 

On the event An n n C^, we have 
1 3n 

11(7 - $7)u;||i > -n-rsologp> y, and ||(^>7 - ^>y)u;||2 < (I 7 I - It'D^^-q logp = Luo logp, 

where we have used the assumption 4r sq logp < n. Combining these two inequalities with the 
preceding two displays, we obtain that the posterior probability ratio on the event Anf~^Bn^Cn 
is bounded as 


7 rn(7 I Y) ^ 1 _ F 2(L + L)logp /2 
I Y) - p>^^ V n/8 

I .exp|L<7±71hii; =yi+£)-i-«-. '17-7 

p'^\/g 71 2 J 


where in step (i) we used the inquality 1 + x < for x G M and the last step follows since 
X p°‘ with a > 8(7y + L) + 1 — k according to our choice of the hyperparameter, which 
completes the proof of the overfitted case. 
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B.2 Case 7 is underfitted and unsaturated 

This case happens only when s* > 1 . Let be the index in our construction of the transition 
function i.e. the index from the set 7* \7 that maximizes \\^ over j G 7* \7. 

Then we have 7' = 7 U {^7}, implying that $7/ — <^7 is a projection operator. Therefore, we 
can write 


1 - - (1 - Rt,) = 


2 ^ - ^^7)^ 


| y ||2 

M II2 


I($ 7 / - + ($y - + ($7 - ^> 7 )u;| 


| y ||2 

M II2 




> 


7 / - $ 7 )X 7 */ 3*.||2 - ||(< 1 > 7 ' - <f'7)X7*)c/3^ty.)c||2 - ||(‘f’7' “ ‘&7)w;||2)" 

IM II2 


By the / 3 n,in-condition and Lemma at the end of this appendix, we have 


II ($ 7 / - > V— - > ^^2 "^7^\7"2 ^ ^‘^Cfiallogp, (51) 


where : = co(L+L+a+K) denotes the coefficient in the / 3 n,in-condition of the theorem. The 
last display shows that at least an amount of u'^CpaQ logp variation in the true signal X^*f 3 *t 
can be explained by adding the influential covariate into the current model 7. Combining 

the above two displays, we obtain that for n sufficiently large, so that > 2 y/L + L, the 

following holds under the event An 


l-Bl-il- Rl) > 


nv 


7*\7 


4s*||y 


Similarly, we have that under the event Cn, 


, .2 _ r^(i-^y)r 

^ ' iiy||2 

IM II 2 

^ (11(1 - $7)^7*/?;. II2 + ||($7 - 4 > 7 )X( 7 )c/ 3 (;.)J |2 + 11(1 - ‘h 7 )u;|| 2 )' 

“ iiyi|2 

17 II 2 

^ 2||(/ - Hi + dlKchy - $7)X(77/3(;.7||i + 4||(I - ci>7)u;||i 

“ iiyi|2 

17 II 2 

W 2||(/ - 4 « 7 )X 7 */ 3 ;* Hi + 4 L^ 7 i logp + Snug ^ 4H(/ - Hi 

nil “ nii 


(52) 


where in step (i) we have used Assumption A, the fact that H(L — 4>7)rcH2 — ll^lli) the 
last step uses 2 H(/ — ^■^)X^*I 3*,\\'2 > 2ni/H/3**^..|,H2 ^ 4 L(Tq logp + SncjQ for Cfs > { 4 L + 3 )/i 7 . 
Combining the above two displayed inequalities, we obtain that for sufficiently large, so 
that v'^CjS > 64 (a+K+ 3 ), the posterior probability ratio under the event A„nC„nPn 
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satisfies 


T^nil I y) 


= p^y/l+g- (^1 




T^n{i\Y) V \\Y\\l/g + YT{I-^^)Y 

< - fv— 

-P Y 9 Ana^/s*+Aiy\\(l 


(i) 

< 


K rr— f-, ■ f ^ u^Cfslogp'^Y^^ 

(**) , - / 2\^/2 

< 1 + 5 • (^1 - (a + K + 3) logp • -j 

<p^ ■ p^ ■ p“("+^+3) 


= P 


-3 


n/2 


(53) 


where in step (i) we have used the inequality a/(5 + a) > min{l/ 2 , o/(26)} for any a,b > 0 


and inequality (51), and step (ii) follows by our assumption on Cg and Assumption D on s* 


B.3 Case 7 is underfitted and satnrated 

This case happens only when s* > 1. Let and be the indices defined in the construction 
of ^( 7 ) in the underfitted and saturated case. Then we have 7 ' = 7 U {j^} \ {k^}. Let 
vi = and V 2 = “ ^ 7 ')^ 7 */^ 7 *- Then Lemma stated and 

proved in Appendix |B.4t guarantees that 


2 > 


IU 2 II 2 < nuj(X)- 


It* \7l 

II /O* ||2 

So - s* 


> nu 


3 * II 2 
7*\7ll2 


and 


< -niy 
- 2 


3* II2 

7*\7ll2 


1 ||(/-$^)A^*/3; 


(54) 


< -ly 
- 2 


It* \7l 


under Assumption D on so- This inequality shows that a larger proportion of the true signal 
X^* fj*, can be explained when the unimportant covariate X^^ is replaced with the influential 
covariate Xj^ in the current model 7 . By letting w = w + be the effective noise, 

we have 


1 - - (1 - = 


y’'(<i.y - fov _ y"’(<i',.u(„) - <i>7)y - r^(4',uo,) - 


| y ||2 

7 II2 


| y ||2 

7 II2 


I + 2ufh; + u)'^(4>^u{y} - ^ 7 )^ “ {ll^2||i + 2vlw + - 4'y)'u}| 


ITIP 
7 II2 


> 


| y ||2 

7 II2 


{ikilb (||^'l||2 - 2||(4>.^u{y} - 4>7 )u;||2) - IIU 2 II 2 • (IIU 2 II 2 + 2||(4>^u{y} - ^7')^ll2) 

“ ll(‘^ 7 U{i.y} “ ‘f’7')^ll2|! (55) 


where in the last step we applied the Cauchy-Schwarz inequality to the two cross terms 
ViW = uj’(<f>..),u{j } — <h-y)r(; and v^w = uf(<l>..yu{j } — <l>y)t(;. Note that under the event An, 
the Cauchy-Schwarz inequality guarantees that 

ll(‘^7U{y} “*^7)^112 < 2|| (4>..yu{y} “‘^7)^112 + 2||A(.^*)c/3(*.^*y||2 < 2 (L-|-Z)(To logp, and 

||(‘^7U{y} “ ‘f*7')^ll2 ^ ^ll(‘^7U{y} “ ‘^7')'“^ll2 + {'y*YA{'y*)A\\ ^ ^(L -|- L)aQ \ogp. 
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Let : = 




• l|2 

7*11 


> ni 2 ^minjg-y* |/3tp > i^^C/^ctq logp. Then, for large enough 


|7*\7l ^ ^ “““Jt7' If-jl 

SO that > 32(L + L), we have, by the /3min-condition and the preceding display, that 

under the event An 


A A 

ll(^7U{T} - ^ 7)^112 < y and ||(^>7u{i^} - ^ 7 ')w^ll 2 < w- 


By the definition of A, we can also write inequality (54) as 


> A and lltJoll < —;=■ 

" " - V 2 


(56) 


(57) 


By plugging in the bounds (|56|) and (|57|) into inequality (|55|), we obtain 

1 - - {1 - R^y) > 


A-{A- A/4:) - {A/V 2 ) ■ {A/y /2 + A/4) - A^/IQ 


| y ||2 

M II 2 


> 


^2 


snii’ 


Combining this with inequality (52), we obtain that for Cp sufficiently large so that v'^Cp > 
192, the following holds under the event An H C,i n Vn 

^n{l\Y) _/ ||F||i(l-i?2-(l-i^2^))^^/2 

I y) 


< 1 - 


< 


\\Y\\l/g + Y^{I-^^)Y 
v\\{l-^^)Xy^;4l/{Ss*) ^n/2 

4nal/s* + 4|| (/ - ||| 




where the last two steps follows by the same argument as for the steps (i) and (ii) in inequal¬ 


ity (62). 


B.4 Lemma and its proof 

Recall the definition of j^y, ky and 4 in the construction of the transition function Q after 


Lemma 


The hrst result in following lemma shows that at least an amount of 
variatioiLin the true signal can be explained by adding Xj^ into the current model 

7 . The second result shows that removing Xj.^ from the model 7 U {jy} incurs a loss in 
the explained variation of at most na;(X)||/3*.y^|||/(so — s*). As a result, if sq satisfies 
the condition sq > {2v~‘^uj{X) 4- l)s* in Assumption D, then it is favorable to replace the 
unimportant covariate X^^ with the influential covariate Xj^ in the current model 7 . 

Lemma 8. Under the conditions and notation of Lemma^ we have: 

(a) If j is underfitted, then 


\^yU{j^}Xy*l3*y4l - \\^yXy*^*y4l > UV 


2 Il'^ 7 *\ 7 ll 2 


(b) If'j is underfitted and saturated, then 


l‘^7U{y}A^7*/^7*II 2 “ ll‘^7U{i.y}\{fc.y}A^7*/^7*II 2 ^ nuj{X)- 


II 2 

So - s* 
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Proof. For each £ G 7* \ 7, Lemma yields 

{I - <^.y)XeXj (/ - $^)x^*/ 3 ;* 


Xf{l-<^>^)Xe 




where the first step follows by the idempotence of projection matrices and the last step follows 
by the normalization condition in Assumption B. By summing the preceding inequality over 
£ € j* \ j, we obtain 


(||ci>^u{n^7*^;*lli-ll‘^7^7*/3rlli) 

fe7*\7 


> (I - ^ 7 ) — - ‘^ 7 )^ 7 */^;* 

(i) 


>I^(/3;.) ^7*(^-^7)^7*/37 


(ii) 




||2 

Il2) 


where in step (i) we used the fact that the vector (l — belongs to the column 

space of X-j,*u7 and applied Lemma and step (ii) follows by applying Lemma Since 
maximizes over ^ G 7* \ 7, the preceding inequality implies 


I^7U{j7}^7*/^7* II2 




> 


Il/57*\7ll2 

| 7 *\ 7 | 


> nu'^ 



l|2 

II 2 


This proves the first claimed inequality. 

Denote the subset 7 U {j^} by 7. For any 7' G denote by j 3 {'^') the least-squares 
solution to the problem 


min 

/ 3 eIRP,/ 3 j= 0 ,i ^7 




( 58 ) 


Given this definition, some simple linear algebra leads to 

||A^(7') - = 11(1 - $y)^7‘\7/5;*\7ll2- (59) 

Since ^7*, we have 

= ||(«F7-a>7\{^|)A7./3;*||i 

= 11(1 - ‘I>7\{fc,})X7*\7/3;*\~||2 - 11(7 - $7)X7*\7/3;*\~||i 

||A^(7 \ {fc^}) - X7*\7/3;.\~||2 - II A^(7) - A7.\7/3;*\~||i, 
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where in step (i) we used the fact that for /c ^ 7 *, and step (ii) 

follows by equation (59). This shows that the second claimed inequality is equivalent to 

wmi \ {^ 7 }) - ^7*\7/5;*\7ii2 - - ^7*\7/5;*\7iii ^ 

5o S 

We use to denote the jth component of /^(y). By the optimality of /^(y) for the least- 

squares problem (58), we have 

Xl{X^ ^( 7 ) - = 0 , for all G 7 - 

Therefore, for each fc G 7 , we have 

\\XKl \ {k}) - ^7*\7/3;*\7ll2 = - ^7*\7/5r\7 " ^^^2 

= \\xpij) - + wxj.ml 

< ||X^( 7 ) - + n|^fc(7)P, 

where in the last step we used the optimality of /3(7\ {A;}) and the normalization assumption 
ll^fclli = Then, by the definition of as the index in 7 \ 7 * that minimizes ||X^*/ 3*||2 — 
Il^ 7 \{fc}^ 7 */^ 1 l 2 = \\^Ml\{k}) - we have 

\ {k^}) - ^r\7/5;*\7ii2 =ii^^(7 \ {k}) - ^7‘\7/3;‘\7iii 

< - ^7*\7/^7*\7ll2 + IA(7)P 

2 
2 


<||X/3(7)-W^*\7/3;.\~||^ + n^ 






n- 


so - s" 


where last step follows since I7 \ 7*| = so ~ s* by the saturation of 7. By our definition and 
Assumption D, 


<u:{X)\\f3;.^jl 

Combining the last two displays yields the second claimed inequality. 


□ 


C Proof of Lemma 0 

We divide the proof into two cases: 7 is overfitted and underfitted. 

Case 7 is overfitted: Let /c = I7 \ 7*| be the number of unimportant covariates selected 
by 7. Since 7 7^ 7*, we have A: > 1 . Then, we can express the posterior probability ratio as 

7 rn (7 I ^ 1 _ / l + ff(l-fi^.) Nn/2 

'^n{'y*\Y) V 1 g(l - i?^) / 

= 1 . (, , ^7 --^7* 

pKA:(l _|-g,)A:/2 \ g,-l _|_ (1 _ ^2^ / 
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Since all influential covariates are included in models and M^*, we have 


i-r; = 


^ — \ » / ^j n — \ I / 


iyi|2 

M II2 


and 1 — R^* = 


iyi|2 

M 112 


Applying the Cauchy-Schwarz inequality yields 
i||(I - <h>||i - 11(1 - 


1-i?" > 


| y ||2 

M II2 


W II ‘^7)'*^ll2 2||X(^*)c/3*^*^c||2 W ||(/— <l>^)r (;||2 — 2ZcrQ logp 


2nii 


2nii 


where in step (i) we used the fact the projection is a non-expansive mapping and in step (ii) 
we used the assumption ||X(^*)c/3^*^*^c||2 — -^'^o^ogp. Similarly, since 7* C 7, we can obtain 
the following inequality for the quantity — R^ 


d 2 _ p2 

•L ^ 


||($^ - + u;)||^ ^ 2||(^'^ - 


|y 


+ 2 Lan logp 


II" 


Under the event An Ci Bn ri Cn, we have ||(/ — $-y)t(;||2 > — rKs* \ogp > ^ and ||(<h^ — 

‘^7*)'^ll2 — ^Acjq logp, where we have used the assumption 4 r Ks* \ogp < n. Combining these 
two inequalities with the preceding two displays, we obtain that the posterior probability ratio 
under the event An n n is bounded as 


I Y) ^ 1 _ A 2 (L + L)logp /2 

7r„(7* I Y) ~ p'^^{l + g)^A V n/8 


(d 1 

— pKk(^-^ _|_ g'^k/2 P 


16 (L + L) logp n-i ^ 8(L+Z)-fc(a+«) 
n 2 J ^ 


(a) 

< p 


- 2 k 


( 60 ) 


where in step (i) we used the inquality 1 + x < e* for x G M and step (ii) follows since x 
with a > 8 {L + L) + 2 — k according to our choice of the hyperparameter. This proves the 
first part. Now we consider the underfitted case. 


Case 7 is underfitted: This case happens only when s* > 1 . Let 7 = 7 U 7*. Denote 
fe = I7* \ 7I and Z = I7I, then I7 \ 7I = k, I7 \ 7*| = k + £ — s*, and I7I = A: + Z < {K + l)s*. 
Since 7 C 7, we can write 


1 -R^^-{1-R^) = 




|y||2 

U II2 


|(«Ly - + ($7 - + (4>7 - <^^)w\ 


|y||2 

U II2 


($; 


> 


- <i>^)x^*p ;42 - iK^by - <f>.,)A(.,.)./ 3(;,).||2 - ||($7 - ^7)^112) 

liy||2 ■ 

lu II2 

By the ;0n,in-condition and Lemma we have 

IK^hy - Hi = 11(1 - 4 )^)x^./ 3 ;*|ii > n 4 ||/ 3 ;*\.,|ii > kal logp, (61) 
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where Cp ■.= cq{L + L + a + k) denotes the coefficient in the /S^in-condition of the theorem. 

Consequently, as long as cq is sufficiently large, we can ensure that > 4^ L + L, and 

hence, under the event An, we find that 


1 - - (1 - > 


{I 


* 11^ 


4rii^ 


Similarly, under the event Cn, we have 


l-Ri = 


< 


| y ||2 

M II 2 


il 


- II 2 + ||(4>y - + 11(1 - ^^)u^|| 2 )' 


iyi|2 
M II 2 


< 


2||(/ - ^^)X^,(5;A\i + 411(^7' - + 4||(I - ch^)u;||2 


| y ||2 

M II 2 


W 2||(/-$^)X.^*/3;.||2 + 4Lc7glogp + 3nug ^ A\\{l - I3*.\ 


| y ||2 

M II 2 


| y ||2 

M II 2 


where in step (i) we have used Assumption A, the fact that ||(/ — 4>..y)r/;||2 < H^r'llii the 
last step uses 2||(/ — ^^')X^*j3** p > 2nz/||/3**y^||| > 4 L(Tq logp + SncjQ for > (4L + 3)/z/. 

Consequently, as long as Cp is large enough so that > 64 (a + k + 8(L + L) + 2), the 
posterior probability ratio under the event An H C„ n is upper bounded as 


T^njl I Y) _ 
I Y) 


|y||i(l-i?2_(l_i?2^))^„/2 

\\Y\\yg + YT{I-^^)Y 


= + gf/"^ ■ (^1 - 

^ V 4nu2/s* , * t w 112; 


n/2 


(d 


<p"^(l+ff)"/2.(l-min{-. 


4nu2/s*+4|| (/-$.,) A.,./?; 

1 Z^C/JS* logP|^^/2 


32n 


(m) 

< p 


(1 + 5)^/^ • - (a + K + 8(L + L) + 2)s* logp • 

‘ (q^+^+ 8(L+Z)+2) ^ ^(/t+Q)(/c—s*)—8(L+L)—2 


2 \n /‘2 


S P 


(62) 


where in step (i) we have used the inequality a/{b + a) > min{l/2, a/(26)} for any a,b > 0 
and inequality ( [M| ) , and step (ii) follows by our assumption on C/s and the assumption s* > 1 
made at the beginning of this underfitted case. 

Since model M;y is overhtted, by the intermediate result (60), we have that under the event 
An n Bn Cl Cn 


= P 


TTnil* I Y) 

Combining the last two displays, we obtain that under the event An C 13^ C C„ n 


I Y) 


^^-(n+a)e-2 ^^-2e- 


TTniY I Y) 

where in the last step we have used Assumption C. 


SP 
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